Large-Scale Analysis of Zipf’s Law in English Texts
نویسندگان
چکیده
Despite being a paradigm of quantitative linguistics, Zipf's law for words suffers from three main problems: its formulation is ambiguous, its validity has not been tested rigorously from a statistical point of view, and it has not been confronted to a representatively large number of texts. So, we can summarize the current support of Zipf's law in texts as anecdotic. We try to solve these issues by studying three different versions of Zipf's law and fitting them to all available English texts in the Project Gutenberg database (consisting of more than 30 000 texts). To do so we use state-of-the art tools in fitting and goodness-of-fit tests, carefully tailored to the peculiarities of text statistics. Remarkably, one of the three versions of Zipf's law, consisting of a pure power-law form in the complementary cumulative distribution function of word frequencies, is able to fit more than 40% of the texts in the database (at the 0.05 significance level), for the whole domain of frequencies (from 1 to the maximum value), and with only one free parameter (the exponent).
منابع مشابه
On the Ranking Property and Underlying Dynamics of Complex Systems
Ranking procedures are widely used to describe the phenomena in many different fields of social and natural sciences, e.g., sociology, economics, linguistics, demography, physics, biology, etc. In this dissertation, we dedicated to study the ranking properties and underlying dynamics embedded in complex systems. In particular, we focused on the scores/prizes ranking in sports systems and the wo...
متن کاملZipf’s law and the grammar of languages: A quantitative study of Old and Modern English parallel texts
This paper reports a quantitative analysis of the relationship between word frequency distributions and morphological features in languages. We analyze a commonly-observed process of historical language change: The loss of inflected forms in favour of ‘analytic’ periphrastic constructions. These tendencies are observed in parallel translations of the Book of Genesis in Old English and Modern En...
متن کاملFractal geometry of texts: An initial application to the works of shakespeare
It has been demonstrated that there is a geometrical order in text structures. Fractal geometry, as a modern mathematical approach and a new geometrical standpoint on natural objects including both processes and structures, is here employed for textual analysis. For this first study, the works of William Shakespeare were chosen as the most important items in Western literature. By counting the ...
متن کاملZipf's Law and Random Texts
Random-text models have been proposed as an explanation for the power law relationship between word frequency and rank, the so-called Zipf’s law. They are generally regarded as null hypotheses rather than models in the strict sense. In this context, recent theories of language emergence and evolution assume this law as a priori information with no need of explanation. Here, random texts and rea...
متن کاملPatterns in syntactic dependency networks from authored and randomised texts
The syntactic relationships between words allow a communicator to express a virtually endless array of thoughts by a finite set of elements. The co-occurrence of words in a sentence reflects the syntactic dependency between words, and can be represented as a directed graph. In this account we compiled the grammar dependency networks of 86 texts from 11 well known English authors. In an analysis...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره 11 شماره
صفحات -
تاریخ انتشار 2016